Library Imports

from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Exploring Joins")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

Initial Datasets

pets = spark.createDataFrame(
    [
        (1, 1, 'Bear', 5),
        (2, 1, 'Chewie', 10),
        (3, 2, 'Roger', 15),
    ], ['id', 'breed_id', 'nickname', 'age']
)

pets.toPandas()
id breed_id nickname age
0 1 1 Bear 5
1 2 1 Chewie 10
2 3 2 Roger 15
groupby_columns = ['breed_id']

Option 1: Using a Dictionary

df_1 = (
    pets
    .groupby(groupby_columns)
    .agg({
        "*": "count",
        "age": "sum",
    })
)

df_1.toPandas()
breed_id count(1) sum(age)
0 1 2 15
1 2 1 15

What Happened:

  • Very similar to pandas agg function.
  • The resultant column names are a bit awkward to use after the fact.

Option 2: Using List of Columns

df_2 = (
    pets
    .groupby(groupby_columns)
    .agg(
        F.count("*"),
        F.sum("age"),
    )
)

df_2.toPandas()
breed_id count(1) sum(age)
0 1 2 15
1 2 1 15

What Happened:

  • Here we use the Spark agg functions.
  • Again, the resultant column names are a bit awkward to use after the fact.

Option 3: Using List of Columns, with Aliases

df_3 = (
    pets 
    .groupby(groupby_columns)
    .agg(
        F.count("*").alias("count_of_breeds"),
        F.sum("age").alias("total_age_of_breeds"),
    )
)

df_3.toPandas()
breed_id count_of_breeds total_age_of_breeds
0 1 2 15
1 2 1 15

What Happened:

  • Here we use the Spark agg functions and aliased the resultant columns to new names.
  • This provides cleaner column names that we can use later on.

Summary

I encourage using option #3.

This creates more elegant and meaning names for the new aggregate columns.

A withColumnRenamed can be performed after the aggregates, but why not do it with an alias? It's easier as well.

results matching ""

    No results matching ""